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Abstract. We present the first sub-quadratic time algorithm that with high probability correctly recon- 
structs phylogenetic trees for short sequences generated by a Markov model of evolution. Due to rapid 
expansion in sequence databases, such very fast algorithms are becoming necessary. Other fast heuristics 
have been developed for building trees from very large alignments [2011] . but they lack theoretical perfor- 
mance guarantees. Our new algorithm runs in 0(n 1+7 ' ff ^ log 2 n) time, where 7 is an increasing function of 

an upper bound on the branch lengths in the phylogeny, the upper bound g must be below| — i/| « 0.15, 
and 7(3) < 1 for all g. For phylogenies with very short branches, the running time of our algorithm 
is close to linear. For example, if all branch lengths correspond to a mutation probability of less than 
0.02, the running time of our algorithm is roughly 0(n 1,2 log 2 n). Via a prototype and a sequence of 
large-scale experiments, we show that many large phylogenies can be reconstructed fast, without com- 
promising reconstruction accuracy. 
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1 Introduction 



Phylogenetic reconstruction is a core bioinformatics problem. Existing algorithms for the problem often require 
very long running times, particularly when the number of sequences is large; this problem is especially acute 
for traditionally slow, yet accurate, methods like maximum likelihood. As biologists start to need trees for 
hundreds of thousands of taxa [10] . even traditionally faster distance methods, such as neighbour joining, 
become unacceptably slow. Researchers have developed sub-quadratic time heuristics |20)lj . but they lack 
theoretical performance guarantees, so it is unclear whether their use is universally appropriate. 

Here, we give an algorithm to correctly reconstruct, with high probability, phylogenetic trees that come 
from a Markov model of evolution, in sub-quadratic time using sequences of O(logn) length. King has proved 
an Q(n 2 ) lower bound on the running time of all distance-based phylogeny reconstruction algorithms 
using such short sequences; our algorithm avoids the problem by using the sequences directly, rather than 
relying only on distance calculations. 

Our algorithm is based on three ideas. First, we use locality-sensitive hashing[T3] to find sequences that 
are near-neighbours in the tree, in sublinear time. This hashing is a first step in choosing which two positions 
should be joined in the tree we incrementally are building. Second, we use reliable estimates of distance, 
to identify exactly the correct join at each step; this step involves some hoary computation, due to the 
need to ensure that inferred sequences are independent estimates. And finally, we use ancestral sequence 
reconstruction to reliably approximate the sequences found at internal tree nodes. Since we start with a forest 
with each taxon in its own tree, and perform this joining step until only one tree remains in the forest, the 
overall runtime is sub-quadratic. Specifically, if p is an upper bound on the mutation probability on any edge, 
and p < 1/2 — y/l/8, then we show that we can do the locality-sensitive hashing, which is the runtime- 
determining step, at each step in 0(n 7 ' p ^ log 2 n) time, where j(p) is always less than 1; the overall runtime 
is thus 0(n 1+ ^P) log 2 n). 

2 Related work 

Our work spans two recent threads in phylogenetic research: theoretical algorithms with guarantees of per- 
formance, and practical algorithms with no guarantees of performance. 

2.1 Principled phylogenetic algorithms 

Erdos et al. [5] gave an 0(n 4 logn) algorithm that reconstructs a phylogeny with high probability, assuming 
the Cavender-Farris model of evolution, for sufficiently long sequences. For most trees, their algorithm runs 
in O(n 2 polylogn) time and requires O (poly log n) sequence length. Csuros [4 provided a 0(n 2 ) algorithm 
with similar performance guarantees. Recent papers [1315] give similar algorithms to identify parts of the 
tree that can be reconstructed. These approaches use quartet queries chosen so that, with high probability, 
only correct quartets are queried. The only sub-quadratic time algorithm with guarantees on reconstruction 
accuracy is by King et al. [15] : for most trees, its running time is 0(n 2 '"f^" ) provided that the sequences 
are O (poly log n) in length. 

King et al. also showed that any algorithm that reconstructs the true tree with high probability, and uses 
distance calculations as its only source of information about the phylogeny, will have Q(n 2 ^j^" ) running 
time, for sequences of length O(polylogn). 

Mossel [19] gave a phase transition for phylogenetic reconstruction. Suppose we have a balanced phylogeny 

with all edges having mutation probability p. If p is less than \ — s» 0.15, then the phylogeny can be 

reconstructed correctly from sequences of length O(logn) sequence data. For larger p, sequences must be 
of polynomial length to allow constant probability of reconstructing the phylogeny. Daskalakis, Mossel and 
Roch |6j extended the phase transition result to unbalanced trees, and provided an 0(n 5 ) algorithm that 
reconstructs phylogenies with lengths below the phase transition. Mihaescu, Hill and Rao jTSj provided an 
0(n 3 ) algorithm for the same problem, which bears some resemblance to our approach. 

2.2 Practical algorithms 

In parallel to these theoretical results, many researchers have developed phylogeny reconstruction algorithms 
that can analyze alignments of tens, or even hundreds, of thousands of taxa. Fast Neighbour Joining [8] 
runs in 0(n 2 ) time and gives results similar to neighbour joining, which requires 0(n 3 ) time. FastTree [20] 
reconstructs phylogenies without computing the full distance matrix, which results in 0(n 15 log n) runtime. 
Our own recent algorithm, QTree |1I2) . runs in O(nlogn) time, using an incremental approach to building 



trees. No theoretical guarantees exist for the quality of solutions obtained from these fast algorithms, however, 
under realistic assumptions about the evolutionary model, for short sequences. 

3 Preliminaries 

3.1 Phylogenetic trees 

A phylogeny is an unrooted, weighted tree whose internal vertices all have degree 3 and whose leaves represent 
extant taxa. The weights represent evolutionary time. Evolution is modelled as a time-reversible Markov 
process operating on the edges of the tree, where each position of a sequence evolves independently of all 
others. 

A quartet is a phylogeny on four taxa. For a set {a, b, c, d} of taxa, there are three possible quartets which 
we denote as ab\cd,ac\bd and ad\bc. 

We assume the Cavender-Farris model of binary character states over ±1 evolving according to a contin- 
uous Markov process. Each edge e is labelled with length lie), and the probability that the ends of e have 
different states is p(e) = Ml — exp(— 21(e))]. If two sequences differ in a p fraction of sites, the maximum 
likelihood estimator of the distance between them is d = — \ log(l — 2p). 

We assume there exist constants / and g such that for each edge e in the phylogeny, we have / < 1(e) < 
g < In 2/4. This gives a minimum length for each edge, and also gives each edge state change probability less 
than 1/2 — -y/l/8, which guarantees a bounded probability of error when reconstructing ancestral sequences [BJ. 
We also assume that all edge lengths are multiples of some constant A, consistent with previous work [6j. 
With these assumptions in place, a surprising fact arises: with sequences of length O(logn), we can exactly 
identify the tree distance between close nodes in the phylogeny |6ll8j . 

Theorem 1. Let A < f < g' < oo. Then there exists a constant c(f ,g',A) such that if f < d(a,b) < g' 
and d(a, b) is a mutliple of A, we have 

Pr[\d(a,b) - d(a,b)\ > ~] < exp(-fc/c(/', g' , A)) 

where k is the sequence length. 

In particular, if k = 3c(/', g', A) Inn, we can identify the correct distance with probability at least 1 — n~ 3 . 

Note also that this theorem applies to any distances in our trees below a constant times g, the upper 
bound on a single edge length. 

3.2 Locality-sensitive hashing 

Our algorithm requires finding pairs of sequences within a specified small distance from each other, without 
having to compute all pairwise distances. Indyk and Motwani |14j solved this problem using a collection 
of randomized hash tables: enough hash tables are chosen so that close sequences likely collide in one of 
the tables, while keys are long enough that distant sequences do not. This idea, known as locality-sensitive 
hashing, has been applied to many problems in bioinformatics, such as motif finding [3]. 

Specifically, Indyk and Motwani solve a related problem, the (r\, ^-approximate Point Location in Equal 
Balls ((ri,r 2 )-PLEB): 

Input: A set of sequences P in {0, l} d , a query sequence q, and radii r\ < r 2 

Output: If there exists a sequence p £ P within normalized Hamming distance r\ from q, output "yes" 
and a sequence within r 2 of q. If there is no sequence in P within normalized Hamming distance r 2 from q, 
output "no" . Otherwise, output either "yes" or "no" . 

Indyk and Motwani's solution constructs n ri ^ T2 hash tables, each keyed on O(logn) randomly chosen 
sequence positions. Given q, a point within distance r\ of it has a constant probability of colliding with it q 
each hash table, while points further than r 2 from q have 0(l/n) probability of colliding. After inspecting a 
constant number of collisions with q, we can find, with constant probability, a point whose distance from q is 
at most r 2 ; if we boost by running O(logn) times independently, the success probability is 1 — n~ a , for any 
choice of a. For more details, see |14) . Overall, finding an (ri, r 2 )-approximate near neighbour for a query 
point q with high probability takes 0(n ri ^ r2 logn) hash table lookups, each on a key of length O(logn) bits. 

The reason Indyk and Motwani solve the approximate PLEB problem is that their hash table solution may 
be overwhelmed by points within the region greater than r\ but less than r 2 away from q. In our domain, we 
can avoid this problem, and find all sequences within distance exactly r from a given query: we choose r 2 to 
be small enough that at most O(logn) points are found within even the r 2 distance, so we can examine all of 
them and still have fast runtimes. We do this by choosing r 2 to be 1/2 — l/2(exp(— 2c/ log log n) , the relative 



Hamming distance corresponding to all sequences within evolutionary distance c/loglogn, for a constant c 
that incorporates errors arising from reconstructing internal sequences of the tree (see Section 3.3 and the 
Appendix for details). In log log n edges, we can reach O(logri) nodes. The distance r^ converges to 1/2 as 
n grows (though it is quite a bit smaller for smaller values of n), so in the limit, the number of hash tables 
grows to n 2ri . 

Finding all neighbours within r\ normalized Hamming distance thus takes 0(n 2ri+e log 2 n) time with high 
probability, where e — > as n increases: we use 0(n 2ri+e logn) hash tables, each of which requires O(logn) 
time to examine, and we take 0(log 2 n) time examining the hash table hits. 

In what follows, we will use the hashing algorithm to find sequences within evolutionary distance d, 
implicitly relying on the simple correspondence between Hamming and evolutionary distances outlined in the 
previous section: our procedure FindAUC'lose(q, d) finds all sequences within evolutionary distance d of q 
with probability 1 — o(l/n 3 ), and so with high probability makes no errors during the course of running our 
entire algorithm. 

3.3 Four-point method 

To identify the correct place to join two trees, we will use the four-point method, which reconstructs quartets 
from the six pairwise distance estimates. The method computes d(a,b) + d(c, d) , d(a, c) + d(b, d) , and d(a, d) + 
d(b,d), and outputs ab\cd, ac\bd or ad\bc, respectively, depending on which is the minimum. If all pairwise 
distances were estimated exactly, the two sums corresponding to incorrect topologies would both be 2£{e) 
greater than the sum corresponding to the correct topology, where e is the middle edge of the true quartet. 

Because in our setting, we can estimate distances exactly with high probability, we can assume that 
all quartets are properly computed, provided that the sequences used to compute them satisfy the error- 
independence property, defined in the next subsection. 

Theorem 2. Let f and g be the upper and lower bounds on the edge length in a quartet tree. Then there 
exists a constant c%(f,g, A) such that we can reconstruct the lengths of the edges of the quartet exactly using 
sequences of length c% (/, g, A) log n with probability at least 1 — \. 

Proof. The claim follows from Corollary 1, by setting /' = 2/ and g' = 3g. 

3.4 Ancestral states 

When all edge lengths are below the phase transition threshold, we can correctly infer the ancestral state of 
a character at any internal node of the tree with probability greater than ~ + ft for some constant /3 [1111916) . 
Using this observation, we identify nodes which should be near-neighbours in the tree, join them together, 
and infer new ancestral node sequences, until we have only a single tree remaining. The following theorem, 
which is an adaptation of a result by Daskalakis et al. [S] , describes the probability of correctly reconstructing 
ancestral states. 

Theorem 3. Let T be a binary tree with root p and edges e all satisfying p[e) < | — \J\- Let ST denote the 
leaves of T . Let a be a Cavender-Farris character on T. The maximum likelihood algorithm A for ancestral 
state reconstruction computes the ancestral state with probability satisfying 



for some constant c independent of T . 

The maximum likelihood algorithm [12] computes the posterior distribution of a state at an internal node 
based on the previously computed posterior distributions at its children. For constant-sized alphabets, this 
can be done in 0(1) time. We then pick the state of largest posterior probability. If the true edge lengths are 
known, this algorithm has the optimal probability of correctly reconstructing the ancestral state, among all 
possible algorithms. 

The proof of [S] used the recursive majority algorithm, making the parameter /3 hard to estimate. We need 
an upper bound on j3 to bound the running time of our algorithm. The following result was proved by Evans 
et al. jlT] . 

Theorem 4. Let T be a phylogenetic tree where all mutation probabilities across edges are bounded by p q < 



Pr[A(a ST ) 




1] = Pr[A(a 5T ) 




l]> 2+ 0. 




Assign to each edge e a resistance (1 — 2p) 2 ' e ' 7 where \e\ is the number of edges on the path from 




Fig. 1. Independence relationships between sequences reconstructed using Fitch's algorithm: in the left tree, sequences 
A and B are error-independent. In the right tree, A and B are not error-independent, but C and B are error- 
independent. The algorithm was run on subtrees drawn in solid lines. 



root to e, including e itself. The probability p err of incorrectly reconstructing the state at the root of the tree 
is bounded by 

1 1 

Perr < 2 ~ l + n^f 

where Tteff is the effective resistance between the root and the leaves of T. 

This bound is quite loose. For this reason, we will use a better bound, originally developed for the simpler 
Fitch parsimony algorithm. The following bound is sharper for p q < 0.118. 

Theorem 5. Let T be a phylogenetic tree where all mutation probabilities across edges are equal to p q < |. 
The probability p err of incorrectly reconstructing the state at the root of the tree using Fitch parsimony is 
bounded by 

Perr K 2 2(1 - 2p g f < Pg 

Note that the original result was stated to hold for variable edge lengths. This was corrected by Zhang et 
al. [21]. 

The above bound applies to the maximum likelihood algorithm, even for variable edge lengths, as long as 
they are bounded by p q . For constant length edges, the result holds by the optimality of maximum likelihood. 
Now suppose that we shrink an edge in tree T, creating a tree T' . The mutual information between the leaves 
and the root of T" is greater than the mutual information between the leaves and the root of T, meaning that 
the probability of reconstructing the root of T' must be higher. Applying this argument to all edges except 
the longest, we obtain that the above bound holds for trees of variable edge lengths when the maximum 
likelihood algorithm is used. 

Knowing the bound on p err is important in our algorithm because we will use locality-sensitive hashing on 
these sequences, and we need to know the number of hash tables required. Better bounds on p err will result 
in faster algorithms with the same performance characteristics. Let g err be the distance corresponding to a 
mutation probability of p err . 

Suppose that we reconstruct ancestral sequences for subtrees Ti,T 2 of T, rooted at p\,p 2l respectively. 
Moreover, suppose that the path connecting T\ and T 2 has ends p±, p 2 (see Figure]!]). By the Markov property, 
reconstructing o~ Pl correctly is independent of reconstructing <r P2 correctly. We call such two sequences error- 
independent. The distance estimate d(o~ Pl , o~ P2 ) will converge to g + gi + .92, where g\ and g 2 are edge lengths 
corresponding to the probabilities of incorrectly reconstructing states in the two sequences. When comparing 
independently reconstructed sequences, we can effectively treat these errors in the reconstructed sequences as 
"extra edges" [6], whose length can be bounded using Theorem 4. This observation has been used extensively 
in many theoretical algorithms |6I19I18) . 



Theorem [3] combined with Theorem [2] and the Markov property enable us to estimate internal branch 
lengths from reconstructed ancestral sequences, and also correct quartets. 

Theorem 6. Let i,j,k,l be ancestral sequences reconstructed by maximum likelihood from four disjoint sub- 
trees, and such that no path between any two of them in the true phylogeny shares an edge with any of the 
subtrees. Let f and g' be the upper and lower bounds on the edge lengths in the quartet ijkl. Then there exists 
a constant c{f ,g', A) such that given reconstructed ancestral sequences of length clogn, we can estimate the 
length of the middle edge ofij\kl exactly, with probability 1 — %. 

4 The algorithm 

The algorithm starts with a forest F of n trees, each with one taxon. It then progressively merges subtrees 
into larger subtrees of the true tree, by finding two nodes that are quite close using locality-sensitive hashing, 
identifying where they should be joined, and inferring ancestral sequences. The underlying idea, however, is 
complicated by the requirement in Theorem [6] that the sequences be reconstructed from disjoint subtrees of 
the true phylogeny. 

Algorithm [T] presents a simplification of the algorithm. Our algorithm maintains four invariants: 

1. Every tree in F is a subtree of the true tree T 

2. No two trees in F overlap as subtrees of T 

3. For each tree T' in F, all edges except at most one have length at most g. The remaining edge has length 
at most 2g. We call it a long edge. 

4. The length of every path in every subtree in F is reconstructed correctly 

The first three invariants are the same as in the work of Mihaescu et al. [18j ; the fourth, we maintain using 
Theorem [6] The invariants, together with routine CheckErrorLndependence, ensure that its preconditions 
are satisfied. 



Algorithm 1 SimplifiedReconstruct({er}, /, g) 
Start with a forest with each node in its own tree. 

Use locality-sensitive hashing to find all sequences whose pairwise distance is less than 3g + 2g err . Put them in a 

priority queue, DistQueue. 

while the forest has more than one tree do 

Find two sufficiently close nodes x and y that are not currently in the same tree of F. 

Identify the nearby edge to x and (k, I) to y that should be joined to connect the trees containing x and y 
in the forest 

Create two new nodes, a and b in the middle of (i,j) and (k,l), and join a and b together with a new edge. 
Estimate the lengths of all five edges in the quartet ijkl. 
Reconstruct the ancestral sequences at a and at b. 

Find all sequences within 3g + 2g err of the newly inferred sequences, and add these distances to DistQueue. 
end while 



In what follows, we will expand the details of this algorithm, focusing on ensuring invariants 3 and 4. We 
will assume the existence of three procedures: FindAUClose(q, d) uses locality-sensitive hashing to identify all 
sequences within relative Hamming distance d of q. Quartet(a, b, c, d) uses the four-point method to identify 
the correct topology of the quartet abed. And MiddleEdge(ab\cd) computes the length of the middle edge in 
the quartet ab\cd. Assuming the preconditions to Theorem [3j these procedures work with high probability. 

Most of the subroutines are presented for the case where all their arguments are internal nodes. The cases 
where some nodes are leaves are analogous; we omit them for brevity. 

We will often treat subtrees with long edges as rooted, with the root located somewhere on the long edge. 

4.1 Independent inferences 

If the reconstructed sequences used to perform quartet queries are not independent, the quartet middle edge 
length estimates and the inferred quartet topology might be incorrect. This could lead to a wrong choice of 
which edges to join. 

The order in which ancestral sequences are reconstructed defines a partial order of the nodes in F. We 
call it the reconstruction order. For two nodes a and x in F, at least one of the children of b with respect to 



Algorithm 2 ChcckErrorIndependence(x, y, a, b) 

Require: a and b are error-independent, x and y are error-independent, T(a) and T(x) do not overlap 
for all 2 G {a, 6, x, y} do 

Let Zi, z 2 be the children of z in reconstruction order(if they exist) 
end for 

d <— MiddleEdge(xy\ab) 
for i,j e {1,2} do 

d ai bj <s— MiddleEdge(xy\a,ibj) 

dxiVj <— MiddleEdge(xiyj\ab) 
end for 

if any of the d ai & or d Xi yj differs from d by more than Z\/2 then 

return false 
end if 
return true 



reconstruction order is not on the path from a to x in T. Check Error Independence uses this observation to 
detect cases when lack of error- independence impacts the middle edge length estimate for the quartet. 
Let ME{xy\ab) denote the true length of the middle edge in the quartet xy\ab in T. 

Lemma 1. If \MiddleEdge(xy\ab) — ME(xy\ab)\ > A/2, then CheckErrorlndependence returns false. Oth- 
erwise, it returns true with high probability. 

Proof. Let T(o) and T(x) be the subtrees that contain a and x, respectively. Sequences a and b are independent 
and lie on the opposite sides of edge e. It follows that if x and y join the tree at e, we have ME(xy\ab) — 
ME(xy\aibj) for any choice of i and j. If x, y, a, b are independent, then all the middle edge estimates are within 
A/2 of each other and correct within A/2. Now suppose some two of these sequences are not independent 
(say a and x). Without loss of generality, that means that T{x) joins T(a) at some edge in subtree of T(a) 
consisting of all nodes based on which the sequence at a was reconstructed. It is easy to see that one of the 
sequences ai,a 2 is then independent of x, in which case its corresponding call to MiddleEdge will return a 
value that is correct within A/2. 

4.2 Detecting overlapping subtrees 

To maintain Invariant 2, we need to prevent merging two trees if the new edge created between them overlaps 
some other subtree in F. The procedure CheckOverlaps detects overlapping subtrees. We use R(T(x)) to 
indicate the set of sequences in T(x) and dr> for the path metric associated with T". 



Algorithm 3 CheckOverlaps (x,y) 

S = (FindAUClose(x, 2g + 2g err ) U FindAUC'lose(y, 2g + 2g err )) - R{T{x)) 
for each sequence a in S do 

for each sequence b in T(a) that is independent of a and such that dx( )(a, b) < 5g do 
if Quartet(a,b,x,y) ^ ab\xy and CheckIndependence(Quartet(a,b, x,y)) = true then 

return true 
end if 
end for 
end for 
return false 



Lemma 2. If the edge (x, y) overlaps some other edge in F and the sequences at x and y are independent, 
CheckOverlaps will return true. Otherwise, it will return false. 

Proof. Suppose T(x) overlaps with some other subtree T' . One can easily show by case analysis that there 
exists a reconstructed sequence in T' that is within distance 2g + 2g err from x or y and such that the two 
sequences are independent. Since T(x) overlaps with T' , there has to exist a node b' in T' such that the 
induced topology on x,y,a,b' is ax\b'y (w.l.o.g. we assume that a forms a clade with x. The reconstructed 
sequence at b' need not be independent from a, x and y, but then there must be a node b within distance 2g 
from b' whose reconstructed sequence is independent of the other sequences, which means ax\by will pass the 



independence test. On the other hand, if T(x) does not overlap with any other subtree, all quartet queries 
that pass the independence test will return ab\xy. 

Note that searching for b can be done by breadth- first search on T(a), without resorting to the nearest 
neighbour search algorithm. 

4.3 Three-way ancestral sequence reconstruction 

In order to connect edges from different subtrees, we need to have independent sequence reconstructions at 
both ends of the new edge. To ensure this can be achieved, we will maintain, where possible, three separate 
sequence reconstructions at each internal node of the subtree, each based on two subtrees of T(x) created by 
removing x, but independent of the third subtree. For any edge e in a subtree, we refer to the two independent 
sequences at its ends as companions. 

Theorem [3] is only applicable when all subtree edges have length less than g. For a subtree with a long 
edge, we treat the subtree as rooted at a node on the long edge with a single sequence reconstruction. When 
that node joins to another tree, we then create new three-way reconstructions; such a tree can only be joined 
with another tree via the long edge, in order to maintain Invariant 3. 

The routine ThreeW ay Reconstruction takes a tree with sequences reconstructed by maximum likelihood, 
adding to each vertex the two remaining reconstructions of its sequence. It must be started from a node that 
has no successors in reconstruction order. 



Algorithm 4 ThreeWayReconstruction(r) 
Let x,y,z be the neighbours of r. 

Reconstruct sequences <T X y{r),a vz (r),a xz (r) conditioned on {x, y},{y, z} and {x,z}, respectively. 
Let S be the set of vertices in T(r) with only one sequence reconstruction. 

Visit vertices in T(r) in Breadth-First Search order, reconstructing each sequence conditioned on all choices of 2 
neighbours. Stop branching if a node visited during an earlier call of ThreeW ay Reconstruction is encountered. 



4.4 Long edges must be joined 

Again, to maintain Invariant 3, when either of the two closest sequences is found in a subtree with a long 
edges, we must break that edge. Because we will be finding the shortest pairwise distance between trees in F, 
we will certainly find one of the endpoints of the long edge, but we must not consider its other neighbouring 
edges. 

Procedure CandidateEdges(x) identifies valid edges that can be broken, given a node x in the tree. 



Algorithm 5 CandidateEdges(x) 
if x is the root of a tree T in F then 

Return the set containing the edge e that contains x. (This may be a long edge.) 
else 

Return the set containing all edges in T within distance 3y + 2g err . (This can be determined by a breadth- first 
search.) 
end if 



4.5 The full algorithm 

With these minor issues resolved, we can present the full algorithm. 

Lemma 3. At any point during the execution of the algorithm, there always exists a pair of subtrees that can 
be merged. 

Proof. There are two cases when two subtrees with internal nodes within distance 2g from each other cannot 
be merged. One is when one of the hits is not a root and ThreeW ay Reconstruction has not been called on 
its subtree. The other is when merging two trees would give rise to a tree with two long edges. 

Let us consider the second case first. For each two subtrees T 1; T 2 with that property, remove them from T 
together with the path connecting them. This gives rise to a forest F\ where at least two components border 



Algorithm 6 Reconstructed}, /,#) 

Start with a forest F where each sequence is in its own tree. 
Initialize hash tables for nearest neighbour search 

Use FindAUClose to identify all sequence pairs at distance less than Zg + 2g err ; put these in a queue DistQueue. 
WaitList <- 

while F has more than one tree do 
(x, y) <s— DistQueue. popQ) 
if T(x) = T(y) then 

continue 
end if 

X 4— CandidateEdges(x) 
Y CandidateEdges(y) 
Joins <-Ix7. 
if X = or Y = then 

WaitList. add((x, y)) 
end if 

Filter Joins to only include pairs ((i,j), (k, I)) where Independent^, j\k, I) 

Use Quartet and MiddleEdge to find d m in, the smallest middle edge length among Joins. Let it be the result of 
joining [i*,]* 1 with (fc*,P). 

create an edge between the two edges (i*,j*) and (k* ,1*) that give rise to d m in 
use MiddleEdge to calculate the lengths di,. . . ,d& of the five new edges 
if max; di > 2g or di > dj > g for some i ^ j or CheckOverlaps(x,y) then 

undo this loop iteration 

WaitList. add((x, y)) 
end if 

if the new tree has a long edge then 
create a root on the new long edge 
else 

reconstruct the sequence at the new internal nodes n , ri 
end if 

if the new tree has no long edge then 

ThreeW 'ay Reconstruction^- 1 ) 
end if 

Use FindAUClose to add all newly-created sequence pairs whose distances are below 3g + 2g err to DistQueue. 
For all hits (x, y) in WaitList at distances less than 3<? + 2g err from any of the newly created sequences, move 
(x, y) from WaitList to DistQueue 
end while 



exactly one of the removed subtrees. Call one such component T" and let T" be its unique adjacent removed 
subtree. Let e' be the edge between T" and T". 

If T' contains a reconstructed subtree that is incident to e' and can be joined with the subtree at the 
other end of e', we are done. Otherwise, we will show that there exist two subtrees in T' that can be merged. 
Consider the forest F c = T' — F. We will refer to components of F c as antitrees to avoid confusion with 
reconstructed trees from F. It is easy to see that since F cannot contain leaves that are internal nodes in T, 
each antitree in F c is either a single edge or contains a cherry. Therefore, each antitree in F c contains two 
leaves at distance less than 2g. 

Take any such two leaves x and y from antitree T-f . If they cannot be joined, then at least one of the trees 
(say T(x)) has a long edge that doesn't include x. The root r at this long edge belongs to an antitree T% that 
must be different from T-f (otherwise we would have a cycle in T) . T% has at least one pair of leaves within 
distance 2g. If they cannot be merged, then one of its leaves is incident to another tree in F with a long edge. 
Its root is incident to another tree T§ in F c . The claim holds by induction on the size of F c . 

Theorem 7. Each iteration of the while loop maintains invariants 1-4- 

Proof (Sketch). Invariants 1 and 4 are maintained due to Theorem [6] and the fact that every quartet query 
is checked for independence. Invariant 2 is maintained by CheckOverlaps. Invariant 3 is maintained by the 
conditions on d m %n m the main loop. 

4.6 Runtime analysis 

At each iteration of the while loop, all the operations except nearest neighbour search (which is invoked a 
constant number of times per iteration) take constant time, given that the ratio g/f is constant (and thus 
the number of nodes within distance g of any node is a constant whose size is 0(2 9 ^). The complexity is 
therefore dominated by the use of the hash tables. The evolutionary distance 3<? + 2g err corresponds to a 
Hamming distance of at most 

h =\- \(l-2 Pg ) 3 (l-2 Perr ) 2 

where p g = — | — - and the bound on p err is given by Theorem [i] FindAUClose is used a constant number 
of times per loop, so the running time of each iteration of the loop is 0(n 2h+e log 2 n) . The loop is run 0(n) 
times, since the number of sequences within distance 3g + 2g err of any sequence is constant. Overall, the 
runtime is bounded by 

Cn 2-(l-2 Ps )3(l-2p OT ) 2 + eiog 2 n < Cn 2-(l-2 Ps f ^X;^ lQg 2 „ < Cn 2-(l-2 Pg f (Sg~lf +e log 2 n 

which is always o(n 2 ). Table 1 shows the runtime for selected values of p g . 



Table 1. The approximate runtime for different values of p g 
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5 Experiments 

5.1 A practical algorithm 

Many assumptions made by the above theoretical algorithm cannot be met in practice. The number of hash 
tables required for FindAUClose to work with high probability requires a prohibitive amount of memory on a 
standard desktop computer. Using maximum likelihood for ancestral sequence reconstruction also requires a 
large amount of memory to store conditional probabilities. This motivated us to develop a simpler and more 
memory-efficient practical algorithm. 



Our implementation uses a number of hash tables required to find near neighbours with constant probabil- 
ity, not high probability. For reasons of memory efficiency, we also do not perform three-way reconstruction; 
instead, we join non-root nodes of different subtrees without requiring the sequences in quartet queries to 
be error- independent. Note that we still require error-independent sequences for estimating branch lengths 
in an existing subtree. After two subtrees are joined, the sequences in the smaller subtree are re-estimated 
according to an ordering compatible with that of the larger tree. 

For simplicity, the practical algorithm does not use routines CheckOverlaps and CheckErrorlndependence. 
Instead, we perform a small number of Nearest Neighbour Interchange (NNI) operations after each join, to 
ameliorate the problems originally addressed by these two functions. After an edge has been added, we re- 
estimate the length of all edges whose length might have been affected by the merge. If any quartet gives a 
topology that is not consistent with the edge, we perform an NNI operation to fix the topology and re-estimate 
the lengths of adjacent edges. While this process might repeat several times, it is not equivalent to traditional 
local search algorithms using NNI's, as only the edges in the close vicinity of the new edge are affected and 
the associated computational cost is much lower. 

The algorithm tries to merge pairs of trees, starting from collisions with the lowest estimated evolutionary 
distance. If no collisions are found within distance (rj + r2)/2, a new hash table is added, until the maximal 
number of 2n Tl ^ T2 hash tables is reached. 

The current version of our algorithm does not attempt to find optimal LSH parameters r% and r%. In our 
experiments, we set them to 0.2 and 0.6, respectively. This choice leads to memory inefficiency for trees with 
very short edges, as we will see later. We leave the automatic adjustment of these values for future work. 

5.2 Data sets 

We used a data set from our previous paper, where we presented QTree ;2|, to compare our new algorithm 
with two other fast phylogenetic algorithms, Neighbour Joining and Fast Tree. We simulated 10 trees on 20000 
taxa from the pure-birth process. We then multiplied the length of each branch by a factor chosen uniformly at 
random from interval [0.5, 2] to deviate the trees from ultrametricity. This methodology follows the previous 
work of Liu et al. jH] . We then scaled the branch lengths of the entire tree by several choices of constant 
factors. For each choice of tree and scaling factor, we generated alignments whose length varied between 250 
and 4000 positions. No indels were introduced in the simulation. 

5.3 Results 

Figure 2 shows the performance of our algorithm, compared with QTree and Fast Tree. We use the Robinson- 
Foulds metric, defined as the fraction of splits in the true tree that are found in the reconstructed tree. To 
ensure fair comparison, we only ran the Neighbour Joining phase of FastTree, without its concluding local 
search phase. 

Our algorithm achieves higher accuracy than both FastTree and QTree in most settings. The main ex- 
ception is trees with long branches, where its accuracy is substantially lower than both QTree and FastTree. 
The poor performance of our algorithm for trees with long branches is not surprising given that it relies 
so heavily on reconstructed ancestral sequences, whose accuracy diminishes as branch lengths approach the 
phase transition. For very short sequences, FastTree appears somewhat more accurate, possibly because of 
aggregating information from a greater number of distance estimates. 

5.4 Running times and scalability 

On most instances, our program runs in times competitive with FastTree and somewhat longer than QTree 
(see Table 2). We believe the running times could be improved by a more careful choice of parameters, and 
note that our work is a preliminary prototype. 

For 32-bit machines with up to 4GB RAM, our program does not scale to alignments larger than 2 ■ 10 7 
letters. This is mostly due to the amount of memory required to store probability vectors for maximum 
likelihood, but also due to the hash tables. Memory usage may also increase if the number of collisions is 
high, since these are stored in a priority queue. 

For trees where average branch length is very low compared to the ri parameter, vast numbers of collisions 
are generated, which leads to a substantial increase in running time and memory usage. We partially mitigate 
this problem by discarding all but top k hits from each hash table entry, but the increase in running time is still 
substantial, sometimes increasing by 3-fold compared to the normal scenario. We plan to solve this problem 
by supporting longer hash table keys and automatically choosing T2 in the final version of the software. 



Table 2. The running times of the three algorithms for three representative data sets. In most cases, our algorithm is 
faster than Fast Tree, but slower than QTree. For very short branches, the number of hash table collisions is very high 
due to T2 being too large, which results in a longer running time for our algorithm. 
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Fig. 2. The performance of the LSH algorithm (red, dashed) compared to QTree(dark blue), and FastTree(light green), 
as a function of the length of the sequences. The four graphs represent the performance on 10 tree topologies with 
branch lengths scaled by constant factors 25, 50, 100, and 200. The accuracy of the LSH algorithm is superior to both 
QTree and Fast Tree in most settings, except for phylogenies with very long branches (scale=200), where the LSH 
algorithm performs substantially worse than the other two, presumably due to poor ancestral sequence reconstruction. 



We also ran our program on the larger simulated 16S data set with 78000 sequences from the Fast Tree 
paper [5D] . We created smaller data sets by randomly sampling 20000 and 40000 sequences from the full data 
set. For the data set with 20000 sequences, our algorithm took 15 minutes, compared with 9 minutes for both 
FastTree and QTree. For the data set with 40000 sequences, our algorithm took 56 minutes, compared with 
26 minutes for FastTree and 19 minutes for QTree. We think that the runtime of our algorithm was impacted 
by the wrong choice of r2, as in previous experiments. Our program ran out of memory on the full data set. 
The accuracies of the algorithms behaved similarly to the other data sets, with our algorithm being more 
accurate than QTree and FastTree. 



6 Conclusions and future work 

We have presented a fast theoretical algorithm that correctly reconstructs phylogenies whose branch lengths 
are short enough. This theoretical algorithm shows the possibility of reconstructing such large phylogenies in 
sub-quadratic time from short sequences, without compromising the accuracy. Our prototype implementation 
achieves accuracies that are comparable or exceed existing algorithms, while also offering competitive running 
times for instances of a few tens of thousands of taxa. We believe that both the accuracy and the running 
time of the algorithm could be improved further. 

This work could be improved in several ways. On the practical side, we plan to improve the scalability 
of the algorithm by using a more flexible and memory-efficient hash table implementation. The applicability 
of our algorithm to diverse evolutionary scenarios will require the ability to set the hash table parameters 
automatically. We also plan to investigate how the runtime of LSH algorithms could be improved by taking 
advantage of rate variability across sites, which may also offer opportunities for higher accuracy in cases where 
long branches are present. 

Some theoretical questions also remain. Fclscnstcin's algorithm is known to have optimal probability of 
reconstructing ancestral states correctly, but this probability appears hard to estimate (see e.g. [T7]). Getting 
tighter bounds on reconstruction accuracy would lead to an improved running time of our algorithm. Another 
avenue for improvement is using a faster locality-sensitive hashing scheme. Dubiner [7] has recently proposed 
such a scheme for very long sequences, but it is not clear whether it will be useful with sequences of only 
logarithmic length. 
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A The effect of reconstruction errors on locality-sensitive hashing 

If two reconstructed ancestral sequences are not error-independent, the distance estimate between them might 
be biased. Here, we show that this has no effect on the running time of FindAllClose and its ability to find 
all sequences within specified evolutionary distance. The following lemma is a direct consequence of Lemma 
5.4 by Mihaescu et al. |T5] . 

Lemma 4. If nodes a and b are not error-independent and their true evolutionary distance is d, d{a, b) < 
d + 2g err with high probability. 

This means that the lack of error-independence will not cause LSH to miss internal nodes within specified 
evolutionary distance. On the other hand, it could happen that the bias from error-dependence generates 
additional hits in hash tables. The following lemma shows that the number of additional hits is at most log n. 

Lemma 5. let 1/2 — l/2(exp(— 2c/loglogn) with c < ln2 in the LSH algorithm. The number of sequences 
b such that d(a,b) > cf log log n and d(a,A(b)) < cf log log n is at most logn. 

Proof. We use a well-known equivalent formulation of Markov chain on trees as a percolation process. For 
each edge in T, we set if to open with probability 1 — 2p and closed otherwise. Each connected component of 
open edges shares the same state and states in different components are independent. Notice that conditioned 
on there being a closed edge between a and b, a reconstruction error in b is independent of the state at a. 
If the true evolutionary distance between a and b is at least / log logn, the probability of them being in the 
same component is at most (log rt) -2 ^. Consequently, the expected normalied Hamming distance between a 
and b is at least 

1 -2f 

- - (logn)T^ + (logn) _2/ 

Picking c < In 2 ensures that this is bounded away from r2 for n large enough. This, together with the fact 
that there are at most logn sequences within distance / log logn of a, concludes the proof. 



